Learning Balanced Mixtures of Discrete Distributions 
with Small Sample 



o 

H-l 



ShuhengZhou szhou@cs.cmu.edu 
Computer Science Department 
Carnegie Mellon University 
Pittsburgh, PA 15213, USA 

'66 
o 
o 

Abstract 

We study the problem of partitioning a small sample of n individuals from a mixture of k product distri- 
butions over a Boolean cube {0, 1} according to their distributions. Each distribution is described by a 
vector of allele frequencies in R K . Given two distributions, we use y to denote the average l\ distance in 
frequencies across K dimensions, which measures the statistical divergence between them. We study the 
case assuming that bits are independently distributed across K dimensions. This work demonstrates that, for 
a balanced input instance for k — 2, a certain graph-based optimization function returns the correct partition 
with high probability, where a weighted graph G is formed over n individuals, whose pairwise hamming 
distances between their corresponding bit vectors define the edge weights, so long as K — Q, (In n/y ) and 
Kn — Q. (inn/y 2 ). The function computes a maximum-weight balanced cut of G, where the weight of a 
cut is the sum of the weights across all edges in the cut. This result demonstrates a nice property in the 
high-dimensional feature space: one can trade off the number of features that are required with the size of 
the sample to accomplish certain tasks like clustering. 

Keywords: Mixture of Discrete Distributions, Graph-based Clustering, Max-Cut 

1. Introduction 

We explore a type of classification problem that arises in the context of computational biology. The problem 
is that we are given a small sample of size n, e.g., DNA of n individuals, each described by the values 
of K features or markers, e.g., SNPs (Single Nucleotide Polymorphisms), where n <C K. Features have 
slightly different frequencies depending on which population the individual belongs to, and are assumed to 
^ ■ be independent of each other. Given the population of origin of an individual, the genotype (represented 

as a bit vector in this paper) can be reasonably assumed to be generated by drawing alleles independently 
from the appropriate distribution. The objective we consider is to minimize the number of features K, and 
thus total data size D = nK, to correctly classify the individuals in the sample according to their population 
of origin, given any n. We describe K and nK as a function of the "average quality" y of the features. 
Throughout the paper, we use pj and xj as shorthands for p^' and x f respectively. We first descr ibe a 
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general mixture m odel that we use in this paper. The same model was previously used in lZhoul ((2006) and 



Blum et a l. (2007). 



Statistical Model: We have k probability spaces Qi, . . . , Q# over the set {0, 1}^. Further, the components 
(features) of z e Q, are independent andPr^, [Z/ = 1] = p\ (1 < t < k, 1 < i < K). Hence, the probability 
spaces Qi, . . . , comprise the distribution of the features for each of the k populations. Moreover, the 
input of the algorithm consists of a collection [mixture) of n = N t unlabeled samples, N t points from 
Q, , and the algorithm is to determine for each data point from which of , . . . , Cl k it was chosen. In general 
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we do not assume that N\ , . . . , N t are revealed to the algorithm; but we do require some bounds on their 
relative sizes. An important parameter of the probability ensemble Q,\, . . . , Q.^ is the measure of divergence 
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between any two distributions. Note that *JKy provides a lower bound on the Euclidean distance between 
the means of any two distributions and represents their separation. 

Further, let N = n/k (so if the populations were balanced we would have N of each type). This paper 
proves the following theorem which gives a sufficient condition for a balanced (Ni = _/V 2 ) input instance 
when k = 2. 



Theorem 1 dzhoul . I2006L Chapter 9) Assume N x = N 2 = N. If K = Q(^) and KN = Q( lnAM ° g 2 logAf ) 
then with probability 1 — l/poly(Af), among all balanced cuts in the complete graph formed among 2N 
sample points, the maximum weight cut corresponds to the partition of the IN points according to their 
distributions. Here the weight of a cut is the sum of weights across all edges in the cut, and the edge weight 
equals the Hamming distance between the bit vectors of the two endpoints. 

Variants of the abov e theorem, based on a m odel that allows tw o random dra ws at each dimension for all 
points, are given in IChaudhuri et al.l (|2007l. Theorem 3.1) and IZhoul (|2006l . Chapter 8). The cleverness 



there is the construction of a diploid score at each dimension, given any pair of individuals, under the 
assumption that two random bits can be drawn from the same distribution at each dimension. In expectation, 
diploid scores are higher among pairs from diffe rent groups than for pairs in the same group across all K 
dimensions. In addition. IChaudhuri et all (120071. Lemma 2.2) shows that when K > Cl(\nn/y 2 ), given 
two bits from each dimension, one can always classify for any size of n, for unbalanced cases with any 
number of mixtures, using essentially connected component based algorithms, given the weighted graph as 
in described in Theorem [TJ 

The key contribution of this paper is to show new ideas that we use to accomplish the goal of clustering 
with the same amount of features, while requiring only one random bit at each d imension. While some ideas 
and proofs for Theorem Q] in Section [4] have appeared in IChaudhuri et all (120071) . modifications for handling 
a single bit at each dimension are ubiquitous throughout the proof. Hence we contain the complete proof in 
this paper nonetheless to give a complete exposition. 



Finding a max-cut is computationally intractable; a hill-climbing algorithm was given in lChaudhuri et al 



(120071) to partition a balanced mixture, with a stronger requirement on K, given any n, as the middle green 
curve in Figured] shows. Two simpler algorithms using spectral techniques were constructed in Blum et all 
(120071) . attempting to reproduce conditions above. Both spectral algorithms in lBlum et al.1 (120071) achieve the 
bound established by Theorem Q] without requiring the input instances being balanced, and work for cases 
when k > 2 is a constant; However, they require n = £1(1 /y), even when k = 2 and the input instance is 
balanced, as the vertical line in Figure Q] shows. Note that when N = ), i.e., when we have enough 

sample from each distribution, K = Q(^p) becomes the only requirement in Theorem Q] Exploring the 
tradeoffs between n and K, when n is small, as in Theorem Q] in algorithmic design is both of theoretical 
interests and practical value. 



1.1 Related Work 



In a seminal paper, iPritchard et al.1 (120001) presented a model-based clustering method to separate popula- 
tions using genotype data. They assume that observations from each cluster are random from some para- 
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Figure 1: This figure illustrates results from three papers. Top and middle curves are algorithmic results 
from IChaudhuri et all (120071). Bottom red curv e are non-algorithm results from this paper with 
single random draw and IChaudhuri et all (120071) with two random draws at e ach dimension. For 
n > Q(l/y), to the right of the vertical dashed line, spectral algorithms iBlum et al.1 (120071) 
achieve bounds given in the red curve. The curves are generated using a biased distribution in 



terms of the t\ distances in allele frequencies: for 9/10 of features, 
the rest, it is 0.1265; for this mixture, y = 0.0016. 



10 ; and for 



metric model. Inference for the parameters corresponding to each population is done jointly with inference 
for the cluster membership of each individual, and k in the mixture, using Bayesian methods. 



Applying spectral techniques by lMcSherryl (120011) on graph partitioning, and an extension due to lCoja-Oghlan 



(2006) from their original setting on graphs to the asymmetric n x K matrix of individuals/fe atures yields a 
polynomial time algorithm for this problem when k is given as a constant, as analyzed by lBlum et al. ( 2007 ). 
For k = 2, an extremely simple algorithm based on examining values in the top two left singular vectors of 
the random matrix can cluster samples efficiently. However, spectral techniques require a lower bound on 
the sample size n to be at least 1/y as shown in Figure [2] 

There are two streams of related work in the learning community. The first stream is the recent progress 
in learning from the point of view of clustering: given samples drawn from a mixture of well-separated 
Gaussians (component distributions), one aims to clas s ify each sample according to w h ich component dis- 



tributi o n it comes from, as studied in Dasgupta J 1999 ^ Dasgupta and Schulman J 2000) ; 



(2001): IVempala and Wangl (120021) : I Achlioptas and McSherrvl (12005): 



Kannan et al 



, , Arora and Kannan 
2005n T iDasgupta et all 



2005 ). This framework has been e x tended to more gene ral distributions such as log-conca ye distributions 



by I Achlioptas and McSherrvl (120051) : lKannan et a l. (2005), and heavy-tailed distributions by lDasgupta et al. 



(|2005l ). as well as to more than two populations. These results focus mainly on reducing the requirement on 
the separations between any two centers P\ and P 2 . In contrast, we focus on the sample size D. This is mo- 
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tivated by previous results (iChaudhuri et all 120071 : IZhoul 120061) stating that by acquiring enough attributes 
along the same set of dimensions from each component distribution, with high probability, we can correctly 
classify every individual. 

While our aim is different from those results, where n > K is almost universal and we focus on cases 
K > «, we do have one commo n axis for comparison, t h e Iz-d i stance between any two c enters of the 
distributions. In earlier works of iDasgupta and Schulmanl ( 2000 ): Arora and Kannan ( 2001 ). the separa- 
tion requirement depended on the number of dimensions of each distribution; this has recently been re- 
d uced to be independent of K, the dimensionality of th e distribution for certain classes of distributions 
in lAchlioptas and McSherr^J 2005 )^ Kannan et al. (2005). This is comparable to our requirement in The- 
orem ED and that of blum et alTl2 007) for discrete distributions. For example, according to Theorem 7 
in lAchlioptas and McSherryl (|2005h . in order to separate the mixture of two Gaussians, \\P\ — P 2 || 2 = 



Q ^-2= + a yiog n^j is required. 



Besides Gaussian and Logconcave, a general theorem in lAchlioptas and McSherryl (I2005L Theorem 6) 
is derived that in principle also applies to mixtures of discrete distributions. The key difficulty of ap- 
plyin g their theorem directly to our s cenario is that it relies on a concentration property of the distribu- 
tion (lAchlioptas and McSherryl 120051. Eq (10)) that need not hold in our case. In addition, once the dis- 
tance between any two centers is fixed, that is, once y is fixed in the discrete distribution, the sample size 
n in their algorithms is always larger than Q (^log 5 K) (lAchlioptas and McSherrv. 2005 : Kannan et al 



2005) for log-concave distributions (in fact, in Theorem 3 of iKannan et all ((2005), they discard at least 



this many individuals in order to corre ctly classify the rest in the sample), and larger than Q(— ) for Gaus- 
sians (lAchlioptas and McSherryl 120051) . whereas n < K always holds when n < - in the present paper. 

The second stream of work is under the PAC-learning framework, where given a sample generated from 
some target distribution Z, the goal is to output a distribution Z\ that is close to Z in Kullback-Leibler 
diverg ence: KL{Z \ \Z^), where Z is a mixture of product distributions over discrete domains or Gaus- 



sians dKearns et al. , 1994; 



2005; Feldman_etaLl,t2005 



Freund and Mansourl 1 19991 : ICryanl 1 19991 : ICryan et all 120021 ; iMossel and Rochl 
120061) . They do not require a minimal distance between any two distributions, 



but they do not aim to classify every sample point correctly either, and in general require much more data. 



2. Preliminaries and Definitions 

Let us first formally define a product distribution over a Boolean cube {0, \} K . 

Definition 2 A product distribution D m , Vm = 1,2, over a Boolean cube {0, 1}*" is characterized by its 
expected value p m = (pj n , . . . , p%) e [0, l]*', which we refer to as the center ofD m . 

We then restate our problem as a fundamental problem of learning mixtures of two product distributions 
over discrete domains, in particular, over the K -dimensional Boolean cube {0, 1}^, where K is a variable 
whose value we need to resolve. We use X = x = (x 1 , x 2 , . . . , x K ) to represent a random ^T-bit vector, 
given a set of K attributes. Sometimes we also use x'j to represent the i th coordinate of point Xj. 

Definition 3 A random vector x from the distribution D„„ which we denote as x ~ D m or x ~ p m , where 
p m is the center ofD m , is generated by independently selecting each coordinate x' to be 1 with probability 
p' m and thus V/, Mm, E^ Dm [f ] = p m . 

We next use the inner-product of two K -dimensional vectors x and y as the score between X and Y, as in 
Definition |H and define a complete graph, where nodes are sample points and each edge weight is the score 
between the two endpoints. 
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Definition 4 score(X, Y) =< x,y >= X*Li x'y'. 

Definition 5 Let X be a sample point from distribution Di and Y be a sample point from D2. Let X', Y' be 
points randomly drawn from Di and D2 respectively, 

diff(X) = [score(X, X')] - E^ h [score(X, Y% 

diff(Y) = E^ h [score(Y, Y')] - [score(Y, X% 

where expectations are taken over all possible realizations ofX', Y' respectively. 
3. The Approach 

Our goal is to show that the perfect partition T = (Pi , P 2 ) is the minimum cut (min-cut) in terms of 
score among all balanced cut (S,S), both in expectation and with high probability. Let us first define these 
objects formally. In this complete graph, let Pi represent the set of points Xi, X 2 , . . . , X N from a product 
distribution Di, and P 2 represent the set of points Y\,Y 2 , . . . ,Y N from a product distribution D 2 . 

Definition 6 Consider a balanced cut (S, S), as in Figure\2\ where L e [1, N/2] is the number of nodes 
that have been swapped from one side of T to the other, let S = {X, e P\,i = 1, . . . , N — L, Vj e 
P 2 , j = I,..., L}, and S = {Yi e P 2 ,i = I, . . . , N — L,Uj e P u j = I,..., L}. Let SCOre(S, S) = 
ZlLl 1 Z?="i L scored, Yj)+ 

Zf=i Z;=i score(Ui, Vj) + Xt"i L Zi=i score{Xi, Uj) + scoreiYi, Vj), which defines score(T) when 
L — 0, i.e., SCore(T) = Zj=i score ( x i> Y j)- 

It is easy to verify that in expectation, the perfect partition has the minimum score, i.e., V balanced (S, S) 
other than T, that is, E[SCOre(T)] < E[score(5, 5)]). The following theorem says that this is true with 
high probability, given a large enough K. 

Theorem 7 For a balanced mixture of two distributions, with probability 1 — l/poly(A^), SCOre(T) < 
SCOre(S, S),for all other balanced cut (S, S), given K = Q(^) and KN = Q( iaNi( fi & N ), and N > 4. 

Corollary 8 Following steps in Theorem^ one can show that if scores are replaced with pairwise Hamming 
distances, i.e., VX, Y, H(x, y) = ^f =i x' © y', the max-cut will identify the perfect partition with high 
probability, given the same order of number of attributes as stated in Theorem^ 



The key technicality in this paper and lChaudhuri et al.1 (120071) is that, instead of showing that each balanced 



cut (S, S) has score that is close to its expected value, we show that, for each balanced cut (S, S), the 
following random variable diff(T, (S, S), L) as in (ffj), which captures the difference between the present 
cut and the unique perfect partition T, stays close to its expected value, which is a positive number, given a 
large enough K. Note that for a particular balanced cut (S, S), diff(T, (S, S), L) > immediately implies 
that SCOre(T) < SCOre(5, S). Figure |2] shows the edges whose weight contribute to: 

diff(r, (S, 1), L) = score(S, 5) - score(T) = (2) 

L N-L 

y y scored, Yi) - score(y,-, X,) + score(f/ 7 -, X,) - score(t//, Y { ). 
j=i i=\ 

The random variable diff(T, (S, S), L), VN/2 > L > 1, comprises exactly of scores over the set of edges 
that differ between those in T and those in (S, S), which is exactly the set of 4L(N — L) edges between 
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Figure 2: Edges that are different between a perfect partition T and another balanced partition (S, S), seen 
only from U\ ~ pi and Vi ~ /?2, and Fi ~ p*2, red dotted edges are in T and green solid edges 
are in (5, S). In more detail, we refer to X, and F,, V/ e [1, A 7 — L] as unswapped nodes, as the 
majority type in their side; we denote V) e (S H P 2 ) 5 £// e (S H Pi), V/ e [1, L] as swapped 
nodes as the minority on their new side. In particular, for (S,S), original cut (red dotted) edges 
that belong to T are replaced with (green solid) edges, which are the new edges that appear in 
(5, 5); the set of common edges that belong to T n (S, S) are not shown. 



swapped nodes and unswapped nodes, among which 4(N — L) edges are shown in Figure[2l Hence we only 
need to consider the influence of 2NK random bits over these two sets of edges contributing to (0, V(5, 5). 
It is not hard to verify the following: 

E[diff(T, (5, S), L)] = (N- L)L (E^, [diff(X)] + [diff(F)]) . (3) 
3.1 Key Idea in the One-bit Construction 



The difference from lChaudhuri et all (120071) is that we require only a single bit at each dimension for score 
in the present paper. The idea that makes an inner-product based score work is that although from an 
individual, e.g., F's perspective, diff(F) may not be significantly positive due to the definition of our score, 
the sum of cliffs over a pair of swapped nodes, e.g., diff(X) + diff(F) as in Figure |3l can be shown to be 
positive with high probability, given K = Cl(\nN/y). Hence we prevent the sum of diff (X) + diff(F) 
from deviating too much from its expected value Ky (Proposition [T3l. by excluding those bad node events 
(Definition [9]), whose probability we bound in Lemma [T6l and [T71 

Definition 9 (Bad Node Event) Let a bad node event £(Z) be the event that {diff(Z) < E[diff(Z)] - 

Ky /A}, where Z is a sample point in the mixture. Note this is an event in an individual probability space 
(Qz, Fz, Prz), where (Qz, Fz, Prz) is defined over all possible outcomes of K random bits for sample 
point Z. 

Note that all bad node events are mutually independent. From now on, we use (Q ; , F ; , Pr,) to refer to 
(Q Zi , F Zi , Pr z , ) for the input 2N nodes, assuming a certain ordering. 

Definition 10 (Bad Event £f) £f is the same as £{Z\) U . . . U £{Z2n) in the product probability space 
(Q, F, Pr) composed of distinct probability spaces (Q ls F 1; Pr^, . . ., (Q.2N, ^2N, ^ v 2n) as m Definition® 
Let £? denote the product probability space (Q, F, Pr) excluding £? . 
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Figure 3: Given Dots~ p x and Triangles- p 2 . Define diff(X) = E[c\X] - E[b\X] and diff(F) = E[d\Y] - 
E[a\Y]. Given K = Q(lnA7y ), with high probability, diff(X) + diff(F) > Ky /2, given that 
[diff (X)] + E^ 2 [diff (F)] = Ky ; Hence a+b < c + d, with high probability, given also 
that KN = Q (In N log log N/y 2 ). 



For each balanced cut (S, S), conditioned upon fixing a subset of random bits on all swapped nodes, as 
shown in Figure [2] to behave nicely in the sense of Lemma [16] and [XT] we show that the conditional ex- 
pectations, in the sense of Definition [20] for random variables diff(T, (S, S), L), VL > 0, are significantly 
positive, so that the perfect partition can almost always win over all other balanced cuts, in terms of the par- 
ticular measure (minimum total score here ), despite the large devi ation events that we handle in Section @] 
This idea has been explored in the proof of IChaudhuri et al.1 (120071) f or diploid scores. 



The key difference between this score and the "diploid score" (see lChaudhuri et alll2007l . Section 2.1) is 
that the corresponding diploid diff (F) is always significantly positive in expectation, i.e., Ej^ Dm [diff (F)] > 
0, Mm = 1,2, and thus remains so with high probability given K = Q.(lnN /y). That is, an individual 
is almost always more similar to a randomly chosen peer from its population, than a randomly chosen 
individual from another population given a large enough K based on "diploid scores". The cost of this nice 
property is: two random bits from the same distribution are required at each dimension from all sample. In 
the present paper, we provide a similar positiveness guarantee, for a pair of scores diff(X) + diff(F), where 
x ~ Di and y ~ D 2 , as illustrated in Figure [3] This property is due to Proposition [T3l Lemma [TBI and [171 
We like to point out that the requirement on the input instance being balanced is due to the fact that we need 
pairing up two individuals such that one comes from each distribution, in order to obtain the initial expected 
minimality for T as defined in Proposition [18] 



3.2 The Expected Difference of Two Edges 

We first show that the perfect partition T has the minimum value among all balanced cuts in expecta- 
tion, when summing up scores over all edges across the cut in Proposition [18] The inspiration for us- 
ing a n inner-product based score and pairing up diff (X) and diff(F), for X ~ Di and Y ~ D 2 , comes 
from [Freund and Mansourl (|1999l) . We first show that the sum of expected differences over X ~ Dj and 
Y ~ D 2 is significant. 



Proposition 11 Ma, b = 1,2, E^ D(iJ ,^ Da [< x, y >] =< p a , p b >. 



Proof We have Ma,b=\,2, Ej~d.,?~d 4 [<x,y>] = E^l 

Pa, Pb > ■ 



Proposition 12 Let X be a sample point from Di and Y be a point from D 2 , diff(X) = x' (p[ — p' 2 ), 
anddiff(Y) = j:f =l y i (P i 2-p[y 



Proposition 13 dFreund and Mansouri.ll999l) E^, [diff(X)] + E^ h [diff(Y)] = \\pi- Pi\\l = Ky. 
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Proof By Proposition [T3 E*^, [diff(X)] + E^ h [diff(F)] = £f =1 p[(p\ ~ pD+TL pM ~ P\) =' 
Pi, Pi - > + < Pi,P% ~Pi >— Ky. I 
Before we proceed, we first state the following theorem and its corollary on Hoeffding Bounds. 



Theorem 14 (IHoeffdingU 19631) IfX\, X 2 , . . . , X K are independent and a, < X t < b t ,\/i = 1,2, ... , K, 
and ifX = (X 1 + ... + X K )/K and p. = E[X], then for t > 0, Pr[X - p > t] < e -2^ 2 ' 2 /S,=i(^-«/) 2 . 



Corollary 15 (lHoeffdind . ll963h // Y\, . . . , Y n , Z\, Z m are independent random variables with values 
in the interval [a, b], and ifY = (Y\ + ... + Y m )/m, Z = {Z\ + . . . + Z n )/n, then for t > 0, 



Pr[Y - Z - (E[F] - E[Z]) > t] < e~ 



2t 1 /{m- l +n- l ){b-aY 



Let us denote w.l.o.g. n = E^ [diff(X)] > Ky /2, and thus E^f, 2 [diff(X)] = Ky - n, and show the 
following two lemmas. 

Lemma 16 Given that K > ^±-L, Pr x [diff(X) < n - Ky /4] < t. 

Proof Let us define yi = (p\ — p\) 2 , V& = 1, . . . , K. Given that x l , . . . , x K are independent Bernoulli 
random variables and (p\ — p^)x k is either in [0, *Jy~k\ or [ — <JJk, 0], Vfc = 1, , . , , K, we apply Hoeffding 
bound as in Theorem IT41 with t = Ky /4K = y /4: 



Prj 



-5>i-/&** + >7>*7/4 



k=\ 



= Pr, 



^{p\-p k 2 )x k -n< -Kyi A 



Lk=l 



< e -2K\ylAflYl =x (sfTk) 1 <x _ 



> 1 - T. 



Thus we have that Pr* [xfLiO? - Pz)x k >n-Ky/A 
Lemma 17 Given that K > ^±-L, Pr Y [diff(Y) < (Ky - n) - Ky /4] < r. 

Proof Similar to proof of LemmafTBI wehavePr } ' j^X/tLiC/ 7 * — P\)y' ~ (Xy — n) < —Ky /4 
Ky - n = E]~ h [diff(F)]. Hence [Pr y [xtM ~ pW > ( K T ~ l) ~ K 7 / 4 
In particular, combining ([3]) and Proposition [T3l we have the following. 

Proposition 18 E[diff(T, (S, S), L)] — (N — L)LKy. 
Proof By Definition [5] we have 

diff(X) = [score (X, X')] - E^ h [score (X, Y')] 



< x , where 



> 1 - T. 



E 



E-, * 
y ~pi 



<x, y' > 



-<X,Pi - p 2 >= ^X'(p\ ~ p' 2 ), 



diff(T) = E^ h [score (Y, Y')] - E^ [score (Y, X')] 



= E-, ~ [< y,Y > 
y ~P2 i J ' J 



-E-, H 

x'~p\ 



< y,x' > 



= <y,p 2 -p l >= J]/' (/4 -p\). 
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Given such a positiveness guarantee on the conditional expectations of diff(T, (S, S), L) described above, 
th e rest of the proof foc us on bounding large deviation events; a sketch of the key ideas has appeared 



in 



Chaudhuri et all (120071 Section 3), based on "diploid scores". We need to show that, with high probability, 



all of 0(2") random variables, in the form of diff(T, (S, S), L), stay positive all simultaneously, given 
enough number of features and total number of random bits. We describe the important ideas of this proof 
in next three sections, which contain key lemmas for each step; more proofs are contained in the appendix 
for completeness of presentation. 



4. Proof Techniques for Concentration 

We first introduce some notation regarding the sample probability space (Q, F, Pr). The set Q is the set 
of all possible outcomes for 2NK random bits, where we denote each bit as b k - for a point j at dimension 
k. The a -field F of events is the set E(Q) of all subsets of Q; and the probability measure Pr is based on 
the product of probabilities of each random bit Vj,\/k, j , corresponding to Bernoulli^), where a e {1,2} 
depends on the population of origin for individual j . Formally, 

Definition 19 The elementary events in the underlying sample space (Q , F, Pr) are all possible 2 2NK 
choices of D = 2NK bits. For < i < D and w e {0, 1}', let B w denote the event that the first i 
bits equal to the bit string w. Let F,- be the a -field generated by the partition of Q. into blocks B w , for 
w e {0, 1}'. Then the sequence Fo, . . . , Fq forms a filter. In the a -field F,, the only valid events are the ones 
that depend on the values of the first i bits, and all such events are valid within. 

The events that we define next and their interactions are shown in Figure [5] We show that, with high 
probability, all of the 0(2 2N ) random variables diff(T, (S, S), L), as in (|2]), one corresponding to each 
balanced (S, S), are positive. We initially confine ourselves into a good subspace £^ by excluding any 
bad node event (Definition [9]). This subspace has the nice property in the sense of Theorem [23] We then 
use union bound to bound the probability of any bad score event in this subspace, where a single bad score 
event occurs when diff(T, (S, S), L) < for a particular balanced (S, S). We use the bounded differences 
method to bound probabilities of such events. 

Each time we examine diff(T, (S, S), L) for a particular balanced (S, S), we let vector (Hi , . . . , H 2 kn) 
record the entire history of random bits, where (Hi, . . . , H 2 kl) record the partial history of bits on the 2L 
swapped nodes corresponding to (S, S). Let { = 2KL be a positive integer. We denote this 2,KL-history 
with H} e \ For a balanced (S, S), let Hea fixed possible /'-history: h = {Ui, . . . , Ul, Vi, . . . , Vi} denotes 
a vector of 2KL random bits on 2L swapped nodes as shown in Figure |2l where X is the outcome of a 
particular point X in our sample. Let Q/, denote that event that we observe this particular 2 /fL -history: 
Q. h = {n g Q : H (C) (n) = h}. Given that Q,, occurs, we are concerned about the following probability 
space (Q/j, Z(Q/,), Pr/,), we have the following definition and proposition. 

Definition 20 [diff(T, (S, 5), L)] = E[diff(T, (S, 5), L) \¥ 2KL ] is the expected value ofdiff(T, (S, 5), L) 
conditioned on an event h e ¥ 2 kl- This conditional expectation E[c//^(T, (S, S), L)\¥ 2 kl] is a random 
variable that can be viewed as a function into Mfrom the blocks in the partition of¥ 2 KL- 

Hence E* [diff(T, (S, 5), L)] is an evaluation at a particular outcome h e F 2 a-l- 

Proposition 21 For a particular outcome h e ¥ 2KL , Ei ± [diff(T, (S, 5), L)] = (N — L) diff(Uj) + 
(N - L) diff(Vj) = (N-L) X U XtM ~ P§)("5 - 
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Our starting point for using the bounded differences method to bound a single bad score event over (S, S) is 
when we have revealed the 2KL bits and obtained a 2A'L-history h in £^ . Given a fixed history h, we call 
the remaining 2K(N — L) bits on unswapped nodes as the 2K (N — L)-future. Let / = (H 2 kl+i, ■ ■ ■ , H 2 kn) 
be a fixed possible 2K(N — L)-future. For simplicity of analysis, given h_, we first expand the confined 
subspace £^ by dropping constraints on the 2(N — L) unswapped nodes. 

In this expanded subspace, we only require the first 2L swapped nodes to be good nodes, a condition 
that we denote with £[{S, S), while leaving bits on the 2(N — L) unswapped nodes unconstrained; that is, 
these nodes can be bad nodes. Thus (Q A , X(Q/,), Pr/,) corresponds to the expanded subspace of £^ given 
h, where we can apply the bounded differences method to analyze probability for {diff(T, (S, S), L) < 0} 
in a clean manner applying Azuma's Inequality as in Lemma l36l In fact, our starting point of the bounded 
differences analysis is E/ L [diff(T, (S, S), L)], where h is a fixed possible 2i£L-history on the 2L swapped 
nodes for (S, S), subject to h e £^{S, S): 

Definition 22 £f-(S, S) is the same as £(U\) U . . . U £(Ul) U £{V{) U . . . U £(Vl) in the product probability 
space composed of distinct probability spaces defined over nodes U\, . . . ,Ui, V\ , . . . , Vl as in Definition [9] 

This immediately indicates that the conditional expected value E/, [diff(T, (S, S), L)] > (N — L)LKy /2, 
which is our "advantageous base point" given that Q/, occurs. The proof of the following theorem appears 
in Section [5] 

Theorem 23 Give that all points are drawn from £j , the probability space (Q, F, Pr) excluding £^ , we 
have V balanced (S, S), where h_ is a particular 2K L-history corresponding to the 2L swapped nodes 
specified over (S, S) with respect to T, 

E k [diff(T, (S, S), L)] >(N- L)LKy/2, (4) 

where the conditional expectation is over each of the individually expanded probability space (Q/, , X (Q/,), Pr/i) 
given h e £\, where £\ is defined in Deftnition\22\ This statement remains true after we require that h_ e £ 2 L 
in addition, where £h is defined in Definition [26] 

Now as we reveal one by one the future 2K(N — L) random bits, the conditional expected values 
E/, diff(T, (S, S), L)\H} C) > 2KL form a martingale that is amenable to the bounded differences 
analysis as shown in Theorem [37] in Section [6] However, in order to obtain a concentration bound as tight 
as that in Theorem [37] we need to exclude one more event £ 2 L as in Definition [26] from the 2 K L-history 
h_, while examining a balanced (S, S). We first give some definitions regarding £ 2 L . Nodes are shown in 
Figure [2] 

Definition 24 Given vectors U\ , . . . , Ul and V\, . . . , vl, where u k -, v> k - are the k th bit of U/ and V/ respec- 
tively, f 2 k (h) = X - =l u) - XU 4 

Definition 25 (Deviation Values) \/k = 1, . . . , K, let t k *JT be the exact deviation on f 2 {h), i.e., f 2 (h) — 
E[f 2 k (h)]=t k VL,\/k. 

Definition 26 (Bad Deviation Event £ 2 ) In probability space (Q, F, Pr), given a balanced (S, S) and its 

corresponding 2K L-history h, £ 2 is the event such that the set of random variables t\, . . . ,t\ regarding 
2KL random bits recorded in h, as defined in Definition\25\ are simultaneously large and satisfy Xf=i *f - 
A = In 2 + 4K In 2(log log N + 1) + 3 In 7Y /2. 



10 



Using Definition [26] and [25] we immediately have the following lemma. 
Lemma 27 Given that h e E^ , we have Vfc, 

|/2©| < \E[rt(h)]\ + \t k VZ , 

and J\ =] ti < A, where t\ is in Definition \25\ and E\ is in Definition |26] 

Proof By definition of t k , Vife, we have that /*(£) = E[/ 2 A ' (/?.)] + fjtVT, where f t e [Z£zMM ; 



Thus the lemma holds given that /j e £ 2 L . ■ 

Excluding £ 2 L from /i is crucial in bounding the difference that each of the 2(N — L)K-future random bits 
causes when we work in probability space (Q h , X(Q A ), Pr h ), where the difference refers to 



E;, 



differ, (s,S),L)\Hp 



Ei, 



diff(r, (s, s),L)\H (t '~ X) 



where 2KN > (' > 2KL depends on the bit, such that the square sum of all these differences is not 
too big as in Lemma [27] This is illustrated in the second graph in Figure [2] This allows us to bound the 
probability on a bad score event, i.e., diff(r, (S, S), L) < 0, using Azuma's inequality in probability space 
(Q/,, S (Q/,), Pr/,) as in Section[6] The proof of the following lemma is rather long and shown in Section lATTI 



Lemma 28 Let h be the specific 2KL-history that we record for a balanced cut (S, S) such that h e S^CiE^. 
Let p\ = ^r. Then for K = Q(^) and KN = Q( lnNlo y g 2 logN ), for all N > 4, 

Pr[diff(T, (S, S), L) < Q\h e E\ n £[, f at random] < p\. 

Eventually we compute the probability of events {diff(r, (5, S), L) < 0} in £f for all balanced (S, S) in 
Section [7] 

5. Proof of Theorem [23] 

This section is dedicated to prove Theorem [23] We first give another definition. 

Definition 29 Ef~ L {S, S) is the same as £ (X t )U . . .U£ (X N _ L )U£(Yi)U. . .U£(Y N _ L ) in the product prob- 
ability space composed of distinct probability spaces defined over nodes X\, . . , , X N _ L and Y\, . . . , Y N _ L 
as in Definition [9] 

Hence £\ and £^~ L imply that no bad node event happens in the appropriate product spaces thus defined. 
We omit (S, S) from £^{S, S) and £^~ L (S, S) when it is clear from the context. Given a balanced cut 
(S, S), h records a history on the 2KL bits on swapped nodes U\, . . . , Ul, V\, . . . , Vl- 

Proposition 30 Given all nodes are drawn from £^, for any balanced cut (S, S) and its particular 2KL- 
history h that we record must satisfy the following: h e £[{S, S). 

Proof Given £f , we know that for all nodes Z\, . . . , Z 2 n, 

diff(Z ; ) >E[diff(Z,)]-^7/4, (5) 
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simultaneously in the product probability space (Q, F, Pr), where diff(Z,) is a random variable solely de- 
termined by node Z,'s bit vector. In particular, for each balanced (S, S), we focus on the product probability 
space that is composed of distinct probability spaces defined over swapped nodes U\, . . . , Ul, Vi, . . . , Vl 
as in Definition l22l After we reveal these 2L bit vectors onUj,Vj,Vj = 1, . . . , L, by ©, 

diff(C/ ; ) > E[diff(tf 7 )]-*y/4,V/ = l,...,L, (6) 
diff(Vj) > E[diff(V,0] -Ky/4,Vj = l,...,L. (7) 

Thus we have h_ & £[(S,S). ■ 

Definition 31 We use f to denote the future of the 2(N — L)K random bits that we are going to reveal 
for the unswapped nodes on a given balanced cut (S, S). Recall that once we are fixed to the probability 
space such that £j does not happen, we know that both h and f are confined; the following two notation 
are equivalent: 

(h££t(S,~S)) n (/ e£»- L (S,S)), 
Qbf) e £?. 

Remark 32 Another way of seeing £[{S, S) (with respect to a particular balanced cut (S,S)) is to view 
it as an event in the simple probability space (Q, F, Pr), such that we put constraints only on the specific 
2L swapped nodes defined on (S, S) while leaving the f at random. Hence we have £^ c £\{S, S) in 
(n,F,Pr). 

We leave this confined space given £± for now and explore the following expanded subspace, where we 
require h_ e £\ while leaving the future / at random. (Q/,, E(Q/,),Pr/,) corresponds to this expanded 
subspace, where he£^. This immediately implies the following lemma. 

Lemma 33 For a balanced cut (S, S), given a particular 2K L-history h e Fikl on the 2L swapped nodes 
such that h e £\, 

E k [diff(T, (S, S), L)\h g £\, f at random] > L(N - L)Ky /2, (8) 

where expectation is over all possible outcomes of the 2(N — L)K random bits in f in probability space 
(Oh, I(Q,),Pr,). 

Proof For a balanced cut (S,S), given h_^£\, where h records 2KL bits over swapped nodes Uj ,V/,\/j = 
l,...,L,by Definition^ 

diff([/ 7 ) > E[diff (Uj)] -Kyi A, Vj = 1,...,L, (9) 
diff(Vj) > E[diff(V,0] -Ky/4,Vj = l,...,L, (10) 

and hence diff(l/j) + diff(V}) _> Ky/2,Vj = 1, ...,L by Proposition [B] Thus, in (Q., ± , I(Qft), Pr*), 
where / is at random and h&£^, we have from Proposition EU 

L L 

E K [diff(T, (S, S), L)] = (N-L)^ diff(^) + (N - L) ^ diff(v ; ) 

7=1 7=1 
L 

> (N-L) ^(diff(I7,-) + diff(V})) > (N — L)LKy /2. 

7=1 
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Recall that £% is the event that no simultaneously large deviation happens across 2L individuals over their 
2KL random bits. 

Corollary 34 Given that h^S^n ££> and f is at random: 

Eh[diff(T,(S,S),L)\h<=£t-n££, fat random] > L(N - L)Ky /2, (11) 

which holds so long as he E\. 

We next bound Eft [diff(T, (S, S), L)] for all balanced (S, S), where h is confined in £f and £ 2 L . We 
now prove Theorem [23] 

Proof of Theorem |23l By Proposition |30l for each balanced cut (S, S), we have 

h<=£f(S,S). (12) 

Now apply Corollary [341 given that h e £f (S, S) Pi £ 2 L , we immediately have the theorem. ■ 

Remark 35 diff(Z) is determined by node Z 's bit pattern, which is the same when we observe it from 
every balanced cut, where it acts as a swapped node. Hence although we do have 0(2") balanced cuts, 
Eft \diff(T , (S, S), L)]/or all balanced cuts are just determined by the 2N random variables diff(Zi), . . . , 
diff(Z2N), each of which is determined by the bit vector of an individual in our sample. 

6. Bounded Differences 

In order to show Lemma [28] (actual proof see Section lATTT ). we prove Theorem[37]in this section, where we 
bound the deviation of random variable diff (T, (S, S), L) for a particular balanced cut (S, S). Recall that we 
let bit vector (Hi, . . . , H 2 kn) record the entire history of random bits that we see, where (Hi, . . . , H 2 kl) 
record the 2/TL-history H}^ on 2L swapped nodes. First it is convenient to introduce some more nota- 
tion: For I' > 2KL, we begin to reveal the random bits on unswapped nodes in (S, S). The random 
variable Eft |^diff(T, (S, S), L)\H} { ^ j depends on the random extension H} { ' of h observed. By definition 

Eft [diff(T, (S,S), L)\H} [,) 00 - Eft [diff(r, (S, S), L)\H {c>) = for % e Oft, where K = HP(n); 

another notation for this is E/, [diff ("T, (S, S), L)|F] where F is the a -field generated by H} 1 " 1 restricted to 
Q/ ? . To prove the theorem, we introduce the following. 

Lemma 36 (Azuma's Inequality) Let Zq, Z\, . . . , Z m = f be a martingale on some probability space, 
and suppose that |Z ; — Z,_i | < c,-, Vi = 1,2,..., m, then 

Pr[|/-E[/]| > t] < 2e- t2/2<T \ 

where a 2 = XHi c r 

We are now ready to use bounded differences approach in (Oft, S(Qft), Pr/,) and prove Theorem [37] 

Theorem 37 Let h be a possible 2KL-history that we record for a balanced cut (S, S) such that h e 
£2 H £\. Then, for t > 0, in probability space (Q/,, E(Q/,), Pr/,), where all future 2(N — L)K random bits 
f are completely at random, 

Ptft [|E* [diff(T, (S, S), L)\H 2KN ] - Eh [diff(T, (S, S), L)]| > t] < 2 e -' 2 ^ 2 , 

where a 2 < 4(N - L)L 2 (Ky) + 4(N - L)L A, for all balanced (S, S) with < L < N/2 swapped nodes. 
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Proof We shall set up things to use Lemma [36] We work in probability space (Q/,, E(Q/,), Pr h ). We start 
to reveal the 2K(N — L) bits on unswapped nodes that are chosen independently at random, and rely on 2L 
swapped nodes having a good history h, given that h e £^ ^ & \- 

Given the a -field (Q/,, E(Q/,)), with S(Q A ) = 2 a 'J-, let us first define a filter F. Given independent 
random bits H 2 kl+i, ■ ■ ■ , H 2KN , the filter is defined by letting F,, V/ = 1, . . . , m, where m = 2K(N — L), 
be the a -field generated by histories hQ kl + 1 \ . , , ; fj} 2KL+i) . We thus obtain a natural F: 

{0, Qh) = F C Fi C . . . C ¥ m = 2^, 

where for < i < m = 2K(N — L), (Q/,, F, ) is a a -field. Hence F corresponds to the increasingly refined 
partitions of Q, h obtained from all the different possible extensions of the 2KL -history h_. 

We obtain a martingale for random variable d iff (T, (S, S), L) such that: Let Z = Eft [cliff (T, (S, S), L)] 

and 

Z V _ 2KL = E* [diff(T, (5, 5), L)\Hp] = E* [diff(T, (5,5), L)\F e -2Ki], (13) 

where F e r_ 2KL is the a -field generated by H} e ' restricted to Q. h and 2KN > £' > 2KL. Let H 2KL+l , . . . , 
H 2K N map to random bits on xj , . . . , x§_ L ,yj,... y^_ L , where xj or yf refers to a single bit on dimension k 
on individual X t or F, respectively. We first define the following, V/ = 1, 2, , , , , m, where m — 2K(N — L), 

\Zj-Zj- l \=c j . (14) 

We also need to translate between cj, where j = 1,2, ... ,m, and djj(Xj) and d^iji), Vi = I, ... ,N — 
L,k = 1, . . . , K that correspond to the bit on dimension k of X t and Y, respectively. In particular, V/, Mk, 
we let 

C(i-\)K+k = di t k(Xi), (15) 

C(N-L+i-\)K+k = di y k(Yi). (16) 

Let j = 2KL + (i - l)K + k - 1, we have 





A Y 1 


^. 






A Y 3 


• 


/ A 




A Y N-L 


V 1 A // 








A/ 


• 







Figure 4: Set of edges that random bits on Fi influence upon 



= |E,[diff(r (5, 5), L)Lff°Vf] - Eft [diff(r, (5, 5), L)|H y) ]| . (17) 
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And similarly, let €' = 2KL + (N - L)K + (i - l)K + k - 1, we have 



di, k (Yi) = \E k [differ, (S, S), L)\H (e '\ yf] - E* [diff(T, (S, S), L)\H^ 

We immediately have the following lemma that we can plug into Azuma's inequality, where applies to 
both di y k(Xi) and (Fj). 

Lemma 38 For the 2(N — L)K random bits on unswapped nodes X,-, 7, V/ e [\,N — L] that we reveal, at 
dimension k e [1, K\ we have 

di, k < \L(p\-p\)\+ t k yfZ , 
where t k is defined in Definition \25\and A as in Definition \26\ and ^t =1 tj~ < A. 

Proof Given that F,-,Vi, comes from D 2 and X,,Vj, comes from Di, and by definition of dj^iji) and 



and 



\P\ 

V~P k 2 



fm 

•k 
2 



fiQD 



fioo 
fiQD 



yf = o, 
yf = h 

xf = 0, 
x k = 1. 



(18) 



Hence given that h e £f, Lemma l27l and |E[/* (/i)]| = \h{p\ — p\)\ as in Proposition l40l 
di, k (Yd < |/*(*)| < |E[/ 2 *(_0]| + = \L(p k 2 - p\)\ + \t k */Z\ , 

and similarly, d^ k (Xi) < \L(p k 2 - p\)\ + t k \TL , where Xf=i l l ^ A - ■ 

We are now ready to obtain a bound for a 2 = 2^^~ { L Xf=i ^ft> wnere df k < \L{p 2 — p\)\ + 
*/L{t k ) ) 2 applies to unswapped nodes X t , Y t , V/ = 1, . . . , N — L, in bounding the differences they cause 
jy revealing the random bits on dimension K. 

Given that Xf = i t\ < A, 



N-L K 



° 2 = T.( d U x ^ +d U Y ^ = 2^4* ^zZZO^-^)! + 1^)1) 

< 2(tf - ^) Z 2 ( L ^2 - rf)) 2 + 2(VL(t k )) 2 

k 

= AL 2 {N-L)^{p k 2 -p\f+AL{N-L)^t 2 k 

k k 

< A(N — L)L 2 (Ky) + 4(N — L)LA, 
where A = SN In 2 + 4K In 2(log log N + 1) + 3 In N/2 as in Definition 
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7. Putting Things Together 

First, there are two lemmas regarding these events. We want to emphasize the we exclude once for all 
2N nodes, while excluding one £ 2 L from each balanced cut (S, S), where L denotes that the event £ 2 * s 
defined over the particular set of 2KL bits across K dimensions on the 2L swapped nodes in (S, S); we 

have (^) number of such events for each L, whose probabilities we sum up later using union bound. 
Lemma 39 Let K > l^ull, j„ probability space (Q, F, Pr), Pr[£ f] <pi = jpx. 
Proof Apply Lemma[[6]to each diff(Z) with r = 1 /N 32 ; Given K > ^f^, we have VZ, 

Prz[£(Z)]<^. 

We adopt the view of composing the product space (Q , F, Pr) through distinct probability spaces (Q i , F i , Pr i ) , 

(Q.2N, F 2 /v, Pr 2 ;v) as in Definition ffOl where (Q,, F,, Pr,), V/, is defined over all possible outcomes 
for K random bits for individual Z ; . Therefore by definition, event £™ is the same as the joint event 
^(Z 1 )n...n^(Z 2JV )in(Q,F,Pr). 

Pr[£f] = Pr[noneof £{Z) happens, for all nodes Z] (19) 

= Pr|£ (Zj) n £ (Z 2 ) n . . . n £ (Z 2N )] (20) 

= P ri [£{Z { )\ ■ Pr 2 [£(Z 2 )] ..... Pr 2N [£(Z 2N )] (21) 

= (1 - Pr, [£(Zi)]) • (1 - Pr 2 [£ (Z 2 )]) • .... (1 - Pr 2JV [£(Z 2N )]) 

> (1 ) 2N > 1 . (22) 



Before we prove Lemma l42l first let us obtain the expected value of f 2 (h), as in Definition [ 
Proposition 40 E[f 2 k (h)] = e[X 7 L =i u) - vf\ = L{p\ - p k 2 ). 
Next we examine the deviation for each random variable f 2 (h), \/k. 

■k 

2 

\fiQ0-Hf2Q0]\>tkVL 

In addition, events corresponding to different dimensions are independent. 
Proof Let us define random variables U k , V k such that 

f k (h) = L(U k -V k ), (24) 
where U k = X;=i u)/L and V k = Xy=i v)/L. Thus by Proposition gOl 

E[U k ] - E[V k ] = jE[f k {h)] =p\- p k 2 . 



Lemma 41 \/k, for random variable f k (h) as in Definition 



Pr 



< 2e-' k \ (23) 
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Now applying Corollary [15] of Theorem [14] to bound probability of deviations on both sides of the expected 
differences, let t = t k \/T/L, we have 



Pr 



\f k (h) - E[/*(£)]| > r k vT] = Pr[| U k -V k - (E[U k ] - E[V k ])\ > t k SL/L~\ 



-2(l k SL/L) 2 2 

< 2e ^ < 2e~'k . 



The following two lemmas shows that {h g £ 2 } remai ns exponentially s mall g iven £ j or not. A variant of 



the following lemma has been used in the full proof for lChaudhuri et al.l (|2007l . Theorem 3.1). It is included 
in Section [A] for completeness. 



Lemma 42 (|Chaudhuri et all 120071 ) In probability space (Q, F, Pr), for each balanced cut (S, S), 

y 2 2N poly(N) ' 



Pr[ft g £ 2 L ] < p 2 , where p 2 = 0( 2iN J, WAn ) and N > 2. 



Lemma 43 Vr[h g £ 2 l \£? ] = Pr[h g £ 2 L \h g ££] < x _ 2 P l jm ■ 
Proof Given the following equations: 

Pr[/z g £ 2 L ] = Pr[h g £ 2 L \h g £\\ ■ Pr[/j g £\\ + Pr[/z g £ 2 L \h g £[] ■ Vr[h g £*-], 



Pr[h g ff] = (1 - — ) 2L > 1 - 2L/N 



32 



we have: 



Pr^/z g £ 2 \h_ g £\^ = 



Pr[h g £ 2 L ] - Pr[h g £ 2 L \h g £f] • Vr[h g ff] 
Pr[/L g £[] 

Prfc g ff] < Pi 
Pr[hz£~f] ~ 1 -2L/W 32 ' 



(25) 

(26) 
(27) 



Lemmai^shows that Pr^ [diff(T, (S, S), L) < 0] remains small regardless whether / stays in the confined 
subspace £^ or is entirely at random as in (Cl h , X(Q/,), Pr/,). 



< 



P3 



Lemma 44 Pr[diff(T, (S, S), L) < 0|& /) e£fn*e £f] _, , ,.. , , A , : . 

Proof We use eo to replace {diff(T, (S, S), L) < 0} and bound the following: 

Pr[e \(h g £f H £ 2 L ) H / e £f- L ], 

which is the same as the term in the statement of the lemma, 

Pr[<? |/j g £ 2 n£[,f at random] = 

Pr[e \(h g n £[) n / e £f _L ] • Pr[/ g ff^lfc g £ 2 l n £f] + 
Pr[e \(h g £ 2 L n ff) n / g e?- L ] ■ Pr[/ g ^f- L |/j g £ 2 L n £[]. 
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By independence between node events: 

Pr[feS?- L \he£inef<] = Pr[f e £?~ L ], (28) 
Pr[/ <=£?- L \h<=£*;nsf] = Pr[f g £?~ L ]. (29) 

Given that events £^,£\ denned on 2L swapped nodes are independent of event £^~ L on 2(N — L) 
unswapped nodes, we have the following, where we omit writing out the / at random condition, 

Pr[e \(h g £\ n £\) D / e £?~ L ] 

Pr[e \h e D ff] - Pr[e |(ft 6 £ 2 L n £, L ) n / e £?~ L ] ■ Pr[f g 

Pr[/ g fp] 

Pr[diff(T, (5, S), L) < 0\h g £ 2 l n £ f] /3 3 L pf- 



Pr[/ g ^] " (1 - ^CV-L) - (1 _ ' 

where Pr[/ g > 1 - following a proof similar to that of Lemma[39] 

Lemma 45 Pr[diff(T, (S, S), L) < 0|^] < TZ ^ JW + ^/j^ . 
Proof By assumption of independence between node events, 

Pr[hz£%\£?] = Pr[h g g £\ n / g E»- L \ = Pr[h g fflj, g £[] < { _ g— - . 

When ft g ff, we give up bounding diff(T, (5, 5), L) < 0; hence by Lemma l43l and l44l 

Pr[diff(T, (5, 5), L) < 0|5f] < Pr[fc g ^1^]+ 

Pr[diff(T, (5, 5), L) < 01(7?., /) e £f n h e £ 2 L ] • Pr[/z g £ 2 L |£f ] 

^2 + 



" 1-2L/N 32 1 — 2(Af — L)/N 32 ' 

Finally, we prove Theorem [TJ 
Proof of Theorem Q] 

Pr[3(5, 5) s.t. score(S, 5) > scoreT] < 

Pr[£»] + X Pr[diff(T, (5, S), L) < 0|£f] 



(S,S) 

32 2-'V 
< 1 — h 

- N n 1 - 2L/N 32 



^(l)(l)i-2(A^-L)/^2 °(p ly(Ao) 
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cN 
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given haslKL history 



h_ g £^ 

diff = diff(T, (s, S), L) 





he£^n£^,fG£ 
Pr[diff < 0\h, f] 



N-L 



P3 



l-2(N-L)/N 32 



Pi — 



1 



P3 



2 1N poly(TV) 

2 
W 4L 



Expand into Subspace Q/, 



Map back 



h g £2 n f t L /: random bits 

Eft [diff(T, (5, 5), L)|fc, /] > 2L(iV - L)X"y 
Azuma's inequality in Q/, 
Pr[diff < 0|/i, /] < p£] 



Figure 5: Events Relationship in Section[7] 
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Appendix A. Proof of Lemma [42] 



The following proof have been used in the full proof in lChaudhuri et all (|2007l . Theorem 3.1). 

Proof of Lemmal42l To facilitate our proof, we obtain a set of nonnegative numbers (fj, . . . , t k ) as follows; 

\fk, to obtain t k , we round \t k \ down to nearest nonnegative number \t k \ that is power of two. It is easy 



to verify that Mk, t k 

VI 



-2L-E[/*(ft)] 2L-E[/*(ft)] 



by Proposition 1401 Thus we have t k < \t k \ < 



2 VI 



+ 



Let us divide the entire range of \t k \ into intervals using power-of-2 non-negative integers as 
dividing points; Let r k , Mk represent the number of such intervals: we have Mk, so long as N > 2, 

r k = log( VI + L(j>\ - p k 2 )/VZ ) < log 2 VI < log2VAV2 < log N. (30) 
Thus we have at most (log AO* blocks in the ^-dimensional space such that each block along each dimen- 



sion is a subinterval of 



0, 



2 VI 



+ 



E[fj(h)] 



. Let B(/?i, . . . , p k ) represent a block in the ^-dimensional 
space, where . . . , p k are nonnegative power-of-2 integers and every point in B(/?i, . . . , p k ) has its value 
fixed in interval [fi k , 2/3 k ) along dimension k, Mk; hence {pi, ... , p k ) is the point in the K -dimensional space 
with the smallest coordinate in every dimension in B(/?i, . . . , p k ). 

A set of values (fi, . . . , t k ) as in Definition [25] is mapped into one of these blocks uniquely as follows. 
We say a point (t u .. .,t k ) maps to B(p u . . . , p k ), if Vk,20 k > \t k \ > p k , i.e., (fi, . . . , t k ) = {fi u . . .,p k ). 
We first bound the following event using Lemma l46l Let us fix one block B(/?i, . . . , p k ) for a fixed set of 
values p x ,...,p k such that Xf=i Pi > A / 4 - 

Lemma 46 Let A /4 = 2N In 2 + K (In 2) (log log N + 1) + (3 In N )/8 as A is defined in Definition [26] 

K 



Pr 



h maps to a fixed B(/?i, . . . , p k ) s.t. tl > A /4 



k=l 



1 



< 



2 2N ■ (log AO* • A" 3 / 2 



Proof Let fiVI, • • • , fjjVIbe the deviation that we observe in h for random variables f^ih), f^Ql)' ■ ■ ■ > ftOk) 
as in Definition |25l If coordinates (fj, . . . , t k ) of h maps to (fi\, . . . , p k ), we know that \/k, 2p k > \t k \ > p k 
given the definition of B(/?i, . . . , p k ). In addition, by Lemma |4T1 we know that 



Pr 



\f}(£) -E[/2©]| > ftVTl < 2 e -^ 2 / 4 , 



(31) 



and events corresponding to different dimensions are independent; Thus we have 

K 



Pr 



h maps to a particular B(/?i, . . . , p k ) s.t. ^ > A /4 



< 



npr 

Jt= 1 



A 

2&vT > -E[/|©]| = |^VI|) > p k SLsX.^p 2 k > A/4 



k=\ 



-E[/2©]| > PkVZs.t.J^Pl > A/4 



/<=1 



< 2*<? _A/16 < 2* exp -(2N In 2 + # In 2(log log N + 1) + 3 In N/2) 
2 K 1 



2 2N ■ (2 log AO K ■ N 3 / 2 2 2N ■ (log AO ■ A" 3 / 2 



(32) 

(33) 
(34) 
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Given that t\ < 4/f , Vk, we know that £f=i ^ A implies that YJk=\ *k - \ X*=i l l ^ A / 4 - Tnus we 
have 



Pr 



A 



./<=1 



< Pr 



= Pr 



]T?> A/4 
/j maps to some B(fii, . . . , J3 k ) s.t. ^jgf > A/4 



(35) 
(36) 



This allows us to upper bound Prfi&f] with events regarding X*=i ? as follows: 



Pr[^] 



Pr 



f](fi Ok) - E[/|®] = ?*VZ) s.t. ]T ^ 2 > A 



Li=i 



k=i 



(37) 



< Pr 



< 



h maps to some B(/?i , . . . , f} k ) s.t. V ^ > A /4 
(log AO* 1 



&=i 



2 2W • (log AO* • A^ 3/2 " 22N Poly(^) ' 



(38) 



Hence the probability that the 2KL unordered pairs induce simultaneously large deviation for random vari- 
ables fj(h), . . . , /* QD< as m Definition |26l is at most p 2 = 0( 2lN p o ly ( W) )- ■ 

A.l Actual Proof of Lemma l28l 

Note that the constant in the lemma has not been optimized. 

Proof of Lemma[28] We take prepf = [diff(T, (S, S), L)] > KL(N - L)y /2 and plug in Theo- 
rem [37j we have the following: 

Pr[diff(T, (S, S), L) < Q\h e £ 2 l n E\\ 
= Prn [Eh [diff(T, (S, S), L)\H 2KN ] - E h _ [diff(T, (S, S), L)] < -E* [diff(T, (S, S), L)]] 

< 2e~' 2/2a2 < 2e~ (KL{N ~ L)y/2)2/2 ' 72 , (39) 

where a 2 < 4(N - L)L 2 {Ky) + 4(N - L)L A as defined in Theorem[37] 
We will prove that for all N > 4, so long as 

1. K > Q(^), 

2. KN > Q( toJVl0g 2 l0gAf ), 



we will have 



2e -t 2 /2° 2 < 2e -(2^i(A'-i))') 2 /2'T 2 < 2 



(40) 



In what follows, we show that given different values of N, by choosing slightly different constants in (Q]) 
and (0, (1451 is always satisfied. 
Case 1: 4 < N < log log N/2y. 

In this case, we require that KN > c ' lnA '' 2 glogA ' , where ci > 1488, which immediately implies the 
following inequalities given that N < log log N / 2 y : 
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1. K > =£i-5 



2 N < 



K log log TV 



4ci In TV ' 

3. log log N > 4y , VW > 4, i.e., we consider cases where y is small enough, 

4. lnN > 21n2, VN > 4. 

We first derive the following term that appears in a 2 as specified in Theorem 

16L(N - L)(32N ln2 + 61nA0 < 5121n2(A r — L)LN + 96(N — L)L\nN 

128 In 2£ (N - L)L log log N 4Sy K(N - L)L 

cilnA^ ci 
64K (N — L)L log log N \2K{N — L)L log log A' 

Cl Ci 

76K(N - L)L log log N 
< v )_ s b (at- L)L log log A^, 

Cl 

given that c\ > 1488. Next, given that Ly < Ny /2 < logl ° gJV , we have 

a 2 < 64/T - L)L(Ly) + 355£(N - L)L log log N + £L(Af - L) log log N 

< 16KL (N - L) log log N + 356KL(N - L) log log N 

< 312KL(N - L)\og\ogN. 

T7- ii • 4. v at -~ 1488 log log TV In TV , 

Finally, given that KN > ^ . we have: 

2/i 2 /-,i^r^7 7\ ^2 2 AKL(N-L)y 2 LKNy 2 2 

2e~' ' < e~ ( ( > y > ' < 2e~-*- !U1 °s lo s' v < 2e~ im °^°° N < 



N 4L 



Thus we also have K > ^f^- = 2916 y lnN given that N < log log N/2y . 
Case 2: t2B*£» <N < Klo ^ N . 

2y — 20 

In this case, K and N are close and we require the following, 
1. K > aM, where c 2 = 512, 

2 _ KN > Co in TV log log TV > whefe ^ = ^ 

Note that constants Co, c 2 above are not optimized; given any N, an optimal combination of c$, c 2 will 
result in the lowest possible K given that K > max{ f ° ln ^!°f logN , c -~}. 

Given that N < ^ lo ^ ogiV , we have: 
16L(N - L)(32Nln2 + 6hiN) < ^K(N - L)L log log N < 2QK (N - L)L log log N, 



and hence 



a 2 < 64K(N — L)L 2 y + 355K(N — L)L log log N + 20K(N — L)L log log N 
< 64{N - L)L 2 Ky +375 KL(N - L) log log N. 
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The following inequalities are due to (Q]) and Q respectively, 

(2KL(N - L)yf 



2 * 64K(N — L)L 2 y 
(2KL(N — L)y ) 2 16 



> 16Llntf, (41) 
L In N, (42) 



2* 315KL(N - L) log log /V ~ 3 
and thus 

2 < (2KL(N-L)y) 2 (2KL(N — L)y) 2 < (2^L(jV - L)y f 
a ~ 16LlnN 16LlnA^/3 " 4L\nN/3 



2 , -(2XL(JV-L)y) 2 



4L 



and 2e" r / 2(7 < 2e < 2^- 4ilnA ' < 2/N 

Case 3: iV > Kl % ogN > 16. 

Here we require that K = C3l ^ N for some c 3 to be determined. Thus we have KN > - 1 —^ gk 

which satisfies the constraint of the form KN > Q( °|° gW ) as in other cases. 
Given that A" > 4, we have that In A^ > 2 In 2 and hence 

16L(iV - L)(32A^ln2 + 61nAf) < 128(iV - L)LN In N + 6NL(N - L) In N 

< \34(N - L)LN In N. 

Given that K log log A" < 20 A', we have: 

a 2 < 64K(N-L)L 2 y +5 12 In 2* (K log log N)(N - L)L + \34(N - L)LN In N 

< 64(N -L)L 2 (Ky) + 512 In 2* 20N(N - L)L + \Q2(N - L)LN In N 

< 64 (^y^J y(N- L)L(N/2) + (N - L)LN In #(128 * 20 + 134) 

< (32c 3 + 2694) (N - L)LN In N. 
By taking c 3 = 188 such that c\ > 4(32c 3 + 2694), we have 

2 2 _ (2K(N — L)Ly) 2 (2c 3 (N - L)L In N) 2 ^ 2(c 3 (N - L)L In N) 2 



t /2d > - = - ; -> 



2a 1 2a 1 ~ (32c 3 + 2694) N (N - L)L In N 

2c 2 (N - L)L In N c 2 L\nN 
> - > 3 - > 4L In N. 

~ (32c 3 + 2694) N ~ (32c 3 + 2694) ~ 

7 2 c^LlnN 

Thus 2e~ t ~l 2(! < 2e ( 32c 3+2«>4) < 2g _4LlnA ' = -^ r . In summary, we have the following requirements. Note 
that N always falls into one of these cases. For all cases, we require that K > Q(ln N/y ) (which is implicit 
for Case 1); the constant that we require in K for Case 2 is larger than that for Case 3, (i.e., C2 > c 3 as in 
above), so that the two cases can overlap. 

• Case 1: 16 < N < log log N/2y. We require that KN > 14881nA, 2 oglogAf , which implies that K > 
29761nA7y. 

. Case 2: < N < Klo ^ N . We require that K > SmK t and KN > 2000 lnN y 2 0g log N . 

• Case 3: N > Klo f° gN . We require K > U®**-. 
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